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We develop a highly scalable optimization method called "hierarchical group-thresholding" 
for solving a multi-task regression model with complex structured sparsity constraints on 
both input and output spaces. Despite the recent emergence of several efficient optimization 
algorithms for tackling complex sparsity-inducing regularizers, true scalability in practical 
high-dimensional problems where a huge amount (e.g., millions) of sparsity patterns need to 
be enforced remains an open challenge, because all existing algorithms must deal with ALL 
such patterns exhaustively in every iteration, which is computationally prohibitive. Our 
proposed algorithm addresses the scalability problem by screening out multiple groups of 
coefficients simultaneously and systematically. We employ a hierarchical tree representation 
of group constraints to accelerate the process of removing irrelevant constraints by taking 
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advantage of the inclusion relationships between group sparsities, thereby avoiding dealing 
with all constraints in every optimization step, and necessitating optimization operation 
only on a small number of outstanding coefficients. In our experiments, we demonstrate the 
efficiency of our method on simulation datasets, and in an application of detecting genetic 
variants associated with gene expression traits. 

1. INTRODUCTION 
In this paper, we propose a very efficient optimization technique for multi-task regression 
with structured sparsity. We are interested in the optimization problem with the following 
general form: 

min h\Y - BX\\% + \ ± |B| + A 2 fi i „(B) + A 3 ^ out (B) (1) 

B Z 

where X £ lR JxAr is the input data for J inputs and N samples, Y £ ]R i ^ xAr is the K 
output data (equivalently K tasks), and B £ IR^ xJ is the regression coefficient matrix. 
Here VL in is an norm for inducing group sparsity among correlated inputs (grouping 

effects in the same rows of B) and Q out is an norm for inducing group sparsity among 
correlated outputs (grouping effects in the same columns of B). In this setting, it is possible 
that there exists overlap between/within input and output groups (i.e., a row group and 
a column group may intersect and hence overlap). Note that this formulation subsumes 
popular special cases such as single task lasso, group lasso, etc.. However, throughout this 
paper, we use the formulation in (JTJ, as it explicitly presents a highly general regression 
problem, and one can still use our algorithm for a single task regression problem by setting 
A 3 = and K = 1. 

Unfortunately, problem ([I]) is non-trivial to optimize as it poses two major challenges for 
large scale problems. First, we need to be able to handle a large number of group sparsities 
efficiently. For example, in eQTL mapping problems in bioinformatics, there exist a very 
large number of groups since the number of input and output groups are proportional to K 
(e.g., 2 x 10 4 ) and J (e.g., 5 x 10 5 ), respectively. Second, we need to deal with overlap of 
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groups within and between and Vt out . Note that a simple coordinate descent algorithm 
is not applicable when fij n or Q out is non-separable. 

The second challenge has been addressed by many optimization techniques including 
P El HH El [201 HH H7l [U HOI H21 E]. For example, Jacob et al. |H] proposed to select the 
union of overlapping groups as the support of sparse vectors. In their optimization procedure, 
input variables are duplicated to convert VL in with overlap into the norm with disjoint groups, 
and an optimization technique for group lasso [13] is applied. Jenatton et al. developed 
Structured-Lasso (SLasso) algorithm for sparsity-inducing norms with overlapping groups 
[9]. A smoothing proximal gradient method (SPG) [5] is developed to efficiently deal with 
overlapping group lasso penalty and graph-guided fusion penalty. Also, an efficient algorithm 
based on alternating direction methods [17] was proposed for overlapping group lasso with 
both £1/^2 norm and £i/£oo norm. Recently, fast overlapping group lasso (FoGLasso) [20] 
was proposed for fast optimization of overlapping group lasso problem based on accelerated 
gradient descent method and a proximal operator. 

However, the first challenge is a scalability problem when there exist a very large number 
of (overlapping) groups, and it has been relatively less studied in previous works. For exam- 
ple, the time complexity of smoothing proximal gradient method (SPG) [5] is 0(J2 gm eg \Sm\), 
where Q is a set of groups, and a primal-dual algorithm for overlapping group lasso [14] has 
time complexity of 0(|£y 3 |), where Q is the set of active groups (groups having non-zero 
elements). At each iteration of SLasso algorithm [9], there is an expensive matrix inversion 
operation, and the inner loop of Picard-Nesterov method [1] and FoGLasso [20] have the 
time complexity of 0(J\Q\). As the number of groups in large-scale problems can be very 
large (e.g. 10 6 ), the scalability of existing algorithms could be severely affected by a large 
number of groups. Thus, there is an urgent need to develop an algorithm highly scalable 
to the number of groups, and in this paper, we present a highly efficient algorithm given a 
very large number of (overlapping) groups. Figure [2] illustrates the efficiency of our method 
in comparison to other competitors including FoGLasso, SPG, and SLasso. 

We present a simple and efficient algorithm called hierarchical group-thresholding method 
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(HiGT) to address the scalability problem for overlapping group lasso. We use the following 
optimization strategy. First, we screen a large number of zero groups simultaneously by 
testing the zero condition of multiple groups. We further improved the speed of this step 
by employing a tree data structure where nodes represent the zero patterns encoded by flj n 
and fl ou t at different granularity, and edges indicate the inclusion relations among them. 
Using the tree data structure, we can avoid checking a large number of zero groups. Second, 
given a small number of nonzero groups of coefficients from the previous step, we solve our 
problem using an efficient method for overlapping group lasso. We used FoGLasso for the 
second step. It is also noteworthy that the accuracy of our screening step is not affected by 
the number of overlapping groups as it relies on exact optimality conditions of zero groups. 
Unlike our method, a large number of overlapping groups can degrade the accuracy of some 
approximation approaches (see Figure [21(b)). 

In our experiments, we first evaluate the efficiency of the first step (screening step). Then, 
we demonstrate the performance of our method in terms of the speed and the accuracy for 
the recovery of structured sparsity via simulation study, in comparison to three state-of-the- 
art methods. As an example of biological analysis, we report a novel and significant SNP 
pair identified by our method, and present discussions. 

Remark The problem ([1]) is originally motivated by expression quantitative trait loci 
(eQTLs) mapping in computational biology. Here eQTLs refer to the genomic locations 
or single nucleotide polymorphisms (SNPs) associated with gene expressions. In eQTL map- 
ping problems, it is believed that many inputs (i.e., SNPs) impose small or medium effects 
on outputs (i.e., expression traits), and we usually have J » N (J ~ 10 6 , N ~ 10 3 ) which 
exacerbate the noise to signal ratio. Thus, it is desirable to explore the groups of inputs to 
increase effective signal strength (individual inputs have too small effects to be detected) for 
more accurate causal SNP identification. It is also desirable to perform multi-task learning 
by jointly considering multiple (possibly correlated) responses to decrease the sample size 
required for successful support recovery [15] (the number of samples is too small to detect 
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small signals). Thus, to take advantage of both input groups Qi n and output groups Q out 
simultaneously, we are interested in solving problem ([1]). 

Notations Given a matrix B G M KxJ , we denote the k-th row by (3 k , the j-th column 
by (3 3 , and the (k,j) element by fi 3 k . Given the set of groups Q = {g ml , . . . , g m |g|} defined 
as a subset of the power set of {1, . . . , J}, j3f m represents the row vector with elements 
{Pi : j G g m , g m G Q}. Similarly, for the set of groups H = {hi, . . . , h^} over K rows of 
matrix B, we denote by f3 3 ho the column vector with elements {(3 J k : k G h Q , h Q G %}. We 
also define the submatrix of B| m as a |h„| x |g m | matrix with elements {(3 3 k : k G h G , j G 
g m , h G G H, gm G £}. 

2. MULTI-TASK REGRESSION WITH STRUCTURED SPARSITY 
We use a linear model parametrized by unknown regression coefficients B G IR^ xJ : Y = 
BX+E, where E G IR^^ is i.i.d. Gaussian noise with zero mean and the identity covariance 
matrix. Throughout the paper, we assume that x*s and y l k s are standardized, and consider 
a model without an intercept. 

Suppose that we are given a set of input groups Q and a set of output groups H. We 
consider a multi-task regression model with structured sparsity: 

1 K j 

ima-||Y-BX||5. + A 1 ||diag(ti7 r B)|| 1 + A 2 5] Yl PtWPth + ^T, E »o\\Pih, (2) 

k=i g m ee 3=1 h ew 

where g m G ^ is the mth group of inputs, h D G % is the oth group of outputs, ||/3f m ||2 = 
y / Eje gm (^) 2 ) and Pholla = ^/EfcGhJ/^) 2 - Here individual or groups of coefficients are 
differently penalized with weig hts w G R KxJ , p G and ^ G There may exist overlap 
between groups in Q and groups in H, and within groups in Q or H. Note that B will have 
zero patterns which are the union of groups in Q and % and individual coefficients. The 
supports of B (nonzero (3 3 k s) will be the complement of zero patterns. As the contribution 
of this paper is to propose an efficient optimization method, for simplicity, we assume that 
all weights are set to 1. 



5 



Example We illustrate an example of the penalty used for problem (j2J). Suppose we have 
two inputs and outputs, {xi,x 2 }, {yi,y2}, and B which includes {j3\, /3f, j3\, For the 
input and output groups, we have Q = {gi}, gi = {1,2}, H = {hi} and hi = {1,2}. Under 
this setting, the penalty for problem (T5]) is given by 



n(B)=A!x;£i#i+ A 'E 



k=i j=i 



1 \ 3=1 J'=l \ 



E(^) 2 - ( 3 ) 



k=l 



3. HIERARCHICAL GROUP-THRESHOLDING 
In this section, we propose an efficient method to optimize problem (J2J) referred to as Hierar- 
chical Group-Thresholding (HiGT). Our algorithm consists of two steps. First, We identify 
zero groups by checking optimality conditions (called thresholding) as we walk through a 
predefined hierarchical tree. After walking though the nodes in the tree, some groups of 
coefficients might not achieve zero. Second, we optimize problem ([2]) with only these groups 
of non-zero /3^'s using an efficient optimization technique available for overlapping group 
lasso. 

Let us characterize the zero patterns induced by lijli norms in problem ([2]). We first 
consider a block of B^™ which consists of one input group (g m G Q) and one output group 
(h G %). Since each group can be zero simultaneously (/3f m = 0, (3 J ho = 0), there exist 
zero patterns for B^™ = when /3| m = 0, VA; G h G or f3 3 ho = 0, Vj G g m . Furthermore, the 
union of multiple B^ m, s can generate zero patterns for B§ = which consists of multiple 
input groups and multiple output groups, {h Q } G H and {g m } G G. One might be able to 
check these zero patterns by checking optimality conditions for each f3f m = and (3 J ho = 0. 
However, this approach may be inefficient as it needs to examine a large number of groups. 
Instead, to efficiently check the zero patterns, we will test multiple groups simultaneously 
(i.e., all groups in B^'" or B§). Also, we will construct a hierarchical tree, and exploit the 
inclusion relations between the zero patterns so that we can identify zero groups efficiently 
by traversing the tree while avoiding unnecessary optimality checks. 
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Figure 1: An example of a tree that contains B§, where H = {hi,li2}, and G = {gi,g2}- 
The root node contains zero pattern for B§ = 0, and the leaf nodes represent the zero 
patterns for B{[™ = 0. 

In Figure (TJ we show an example of the tree for B§ when H = {hi,h2}, G = {gi,g2}, 
and |gi| = | g2 1 = | hi | = | h.2 1 = 2. We denote the set of zero patterns of B (i.e., B^'s or 
B§'s) by Z = {Zi, . . . , Z\z\}- For example, Z\ can be a zero pattern for B§ = (the root 
node in Figured]). Let us denote ~B{Z t ) by the coefficients of B corresponding to Z t 's zero 
pattern. Then we define a tree as follows. A node is represented by Z £ Z, and there exists a 
directed edge from Zi £ Z to Z 2 £ Z if and only if Z\ D Z 2 and $Z £ Z : Z\ D Z D Z 2 - Note 
that each layer encodes different granularities of sparsity pattern. When we have multiple 
B§ s, we can generate a subtree for each B§ separately, and then connect all the subtrees 
to the dummy root node for B = 0. 

We can observe that our procedure has the following properties. First, by testing zero 
conditions for each node, we can identify multiple zero groups simultaneously. Second, 
walking through the tree, if B(Z t ) = 0, we know that all the descendants of Z t are also 
zero due to the inclusion relations of the tree. Hence, we can skip to check the optimality 
conditions that the descendants of Z t are zero. 

Considering these properties, we develop our optimization method for the following rea- 
sons. First, if B is sparse, our method is very efficient since we can skip optimality checks 
for many zero patterns in Z. Mostly we will check only nodes located at the high levels of 
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the tree. Second, our method is simple to implement. All we need is to check whether each 
node in the tree attains zero. After identifying zero groups, we solve problem (J2J) with a 
small number of non-zero groups of coefficients using an available optimization technique. 
Specifically, our hierarchical group-thresholding method has the following procedure: 

1. Construct a tree that contains the groups of zero patterns of B (i.e., B^ 1 and B§). 
In our experiments, we used two input and two output groups for each B§, i.e., |H| = 
|G| = 2. 

2. Use depth-first-search (DFS) to traverse the tree, and check optimality conditions to 
see if the zero patterns at each node Z achieve zero. If Z satisfies the optimality 
condition to be zero, skip the descendants of Z, and visit the next node according to 
the DFS order. 

3. With the groups of f3 J k s which did not achieve zero in the previous step, we solve 
problem (j2J) using an available optimization algorithm for overlapping group lasso. We 
used FoGLasso [20] for this step. 

In the next section, we show two main ingredients of our optimization method that include 
1) the construction of a hierarchical tree, and 2) the optimality condition of each Z G Z in 
the tree. 

3.1 Construction of Hierarchical Tree 

Here we consider each B§ separately. We first generate a tree for each B§, and then combine 
them to make a single tree. In each block of B§, we examine the zero patterns of B| m , which 
are included in B§, {g m } G G, {h } G H. These zero patterns of B^™ are shown in the leaf 
nodes in Figure [TJ Note that even though we present a two- level tree throughout this paper, 
one can design a tree with multiple levels. Then we need to determine the edges of the tree 
by investigating the relations of the nodes. Given their relations between B§ and B^™ (i.e., 
B§ D Bf[ m ), we create a directed edge Z\ — > Z 2 . Finally, we make a dummy root node and 
generate an edge from the dummy node to the roots of all subtrees for B§ = 0. 

8 



3.2 Screening Rules for Multiple Groups 

We present rules for checking zero conditions of each node in the tree. We start with 
optimality condition for problem (J2]) by computing a subgradient of its objective function 
with respect to f3 3 k and set it to zero: 

(y fc - f3 k X) ( Xj f = A4 + A 2 c£ + X 3 4, (4) 

where s 3 k , c? k and d k are a subgradient of penalties in problem (J2J) with respect to j3 3 k . 

We first show a rule for identifying B| m = which includes |h Q | output groups and |g m | 
input groups of coefficients. We assume that our algorithm starts with B = 0, and set 
/3 fc X = 0, Vfc. Under the assumption, we can test B^ 1 = separately using Eq. fllj). 



Proposition 1 Bg; = i/ E feeho E jegm M>9) T - Ai4| < A 2 - A 3 yls 



where 



i_ j ^ */|y fe (x,r|<Ai 
si^n (y fc (x j ) T ) i/ |y*(x i ) T | > A x . 



Proof From the optimality condition in (j4]), B^™ = if 

E E W x .) T - A ^i} 2 = E E W4) + a 3 k)} 2 

< Al|h | + Ag|gJ + 2A 2 A 3 E E 

Here we used the fact that £ jegm (4) 2 < 1, E fe6ho K) 2 < 1 and |E fceho E, egm (4)K 
Efceh E jegm (4) 2 EA ; eh E je g ( rf I.) 2 < |gm||h | by Cauchy-Schwarz inequality. The above 



2 

< 



inequality holds since -\/|g m ||h | < ^ keho J2 jeSm (c' k ){d 3 k ). Also, s{ G [-1,1] is deter- 
mined to minimize the left-hand side of the inequality, which is equivalent to applying soft- 
thresholding to Vfc(Xj) T . □ 

Note that Proposition [1] becomes the condition to identify a zero group for overlapping 
group lasso when A3 = and K = 1 (Lemma 2 in [20]). Based on Proposition [TJ we further 
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propose a rule for identifying B§ = 0, {g m } £ G, {h Q } G H as follows: 

B § = 0if E E M x i) T - A i4l< E E |A 2 vW-Wi6j|- (s) 

fceh ,h eH iegmigmSG h„eH g,„eG 

This rule does not guarantee that optimality conditions hold for B^™ = for all h Q G H 
and g m G G. However, in all of our experiments, we observed no violations when |G| = 
|H| = 2, and it was very efficient to identify a large number of zero groups simultaneously. 
Here we give some motivation for this rule. Let us denote ^fceh D Sjeg m |yfc( x i) T ~~ ^ s i\ by 



L mi an d 



Ms/Wo\- ^3 VT&mT by Rom- If L om < R om for all (o,m), this rule is satisfied, 
and it correctly discards groups in B§. Now we claim that if L om > R om for some (o,m) 
(there exist some nonzero blocks, i.e., B^ 1 ^ 0), P^(J2 orn L om < J2 om Rom) is small, and 
we are unlikely to discard nonzero blocks. Suppose L om ~ jV(7, cr) if B^™ = 0, and L om ~ 
Af(r,a) if Bg" 1 7^ 0, where a is a constant, and < 7 < R om < (1 + S/Q)R orn « 
t. Here S 1 and Q are the number of zero and nonzero blocks in B§, respectively, and 
thus S + Q = |H||G|. Then, by Hoeffding's inequality, Pr(£] 0)m L om < ^0,™^°™) ^ 
exp I —2 ^E(X] m L om ) — J2 o m Rorn^j fC | , where C is a constant. We can see that if B§ 7^ 
°> Pr (Eo,m L om < E , m ^m) is n kely to be small since E(Z o , m L °"0 = r< 2 + >> 
(1 + S/Q) sup{R om }Q + 'jS > J2o m Rom- Therefore, the rule in (JSj) would work well when 
S/Q is small since the assumption for r can be weak. However, it should be noted that if 
|G| + |H| is large, S/Q can be very large (S » Q), and the assumption for r becomes too 
strong. As a result, this rule may be violated if we test very large blocks. 

From computational perspective, the rule in (jHJ) significantly decreases the number of 
iterations for identifying zero groups as we can test a block of coefficients consisting of 
multiple input groups and multiple output groups. Note that each test can be performed 
very efficiently by summation of elements in a pre-computed matrix, and the speed for each 
test can potentially be further improved by GPU [2]. 
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4. EXPERIMENTS 

In this section, we show the efficiency and accuracy of our proposed method using simulated 
datasets, and present its usefulness for eQTL mapping, an important application in bioin- 
formatics. We also present comparison between our optimization method and three other 
competitors including Fast overlapping Group Lasso (FoGLasso) [20], Smoothing Proximal 
Gradient method (SPG) [5], and Structured Lasso algorithm (SLasso) [9j. Note that Fo- 
GLasso is a state-of-the-art method for overlapping group lasso, and Yuan et al. showed 
that FoGLasso is significantly faster than other alternative methods [2D]. 

We designed our experiments as follows. In section 14.14 we first present the efficiency of 
screening step in our method for a wide range of tuning parameters. Then, we present the 
speed and accuracy of our method under various settings in comparison to other methods. 
Finally, we confirm the usefulness of our method by showing an interesting interaction effect 
between a pair of genetic variants in yeast that we identified using our method. 

4. 1 Evaluation of Efficiency of Our Method Via Simulation Study 

To systematically evaluate the efficiency of our method, we generated simulated datasets 
as follows. For generating X £ M. JxN , we first selected J input covariates from a uniform 
distribution over [0, 1] for N samples. Then we defined input and output groups as follows. 
For input groups, we selected the size of input groups from a uniform distribution over [5, 10], 
denoted by li(5, 10), and the size of overlapping inputs between two consecutive groups was 
selected from U(l, 4). For output groups, the size of output groups was selected from U(3, 5), 
and the size of overlap with the previous output group was drawn from W(l,2). We then 
simulated B £ IR /<xJ , i.e, the ground-truth coefficients, which includes 52 nonzero coefficients 
(JJ{ = 3). Given X and B, we generated K outputs by Y = BX + E, E ~ A/"(0,I). We 
generated 10 different datasets for each simulation setting with N, K and J, and report the 
average CPU time and average accuracy using Fl score, which is harmonic mean of precision 
and recall rates. Given an estimated B, precision is defined by the ratio of the number of 
correctly found nonzero coefficients to the total number of estimated nonzero coefficients, 
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and recall is denoted by the number of correctly found nonzero coefficients divided by the 
total number of true nonzero coefficients. Throughout all the experiments, we employed a 
two-level tree (excluding the dummy root node), where the nodes at the first level contain 
a block of coefficients consisting of two input groups and two output groups, and the leaf 
nodes include a block of cofficients with one input group and one output group. 

Evaluation of Efficiency of Screening Step Via Simulation Study We first evaluate 
the efficiency of screening step (the first step in Algorithm [TJ) for a range of tuning parameters 
{0.001,0.002,0.005,0.01,0.02,0.05,0.1,0.2,0.5} using simulation datasets with N = 1000, 
J = 5000 and K = 5, and Table [1] shows the results. For simplicity, we set Ai = A2 = A3 
denoted by A. From the table, we can observe that screening time drops significantly as 
A changes from 0.05 to 0.02, which indicates that many coefficients were discarded in the 
first level of our hierarchical tree. Indeed, the number of selected groups was decreased 
from 17071 to 116 without missing true nonzero coefficients. It should be noted that the 
updating time (the second step in Algorithm [TJ) was also substantially reduced when A is 
changed from 0.02 to 0.05 due to the small number of groups selected by screening step. 
Thus, our algorithm became very efficient from A = 0.05 since both screening and updating 
step were very fast. For large tuning parameters (e.g. A > 0.2), we started to miss true 
coefficients, and when A = 0.5, all coefficients were set to zero due to heavy penalization. 
We can observe that A = 0.05 or 0.1 are appropriate for our simulation datasets, and in the 
following experiments, we will use these two tuning parameters. 

Evaluation of Speed and Induced Structured Sparsity Via Simulation Study We 

compared the speed and the accuracy of our HiGT method with the three alternatives of 
FoGLasso, SPG and SLasso. We first show the results under a single task regression setting 
where A3 = 0, and K — 1 (this setting was used in previous papers for FoGLasso, SPG 
and SLasso). Figure |2]^a,b) show CPU time and Fl score of the four methods with different 
number of input variables from 1000 to 20000, fixing N = 1000, K = 1, Xi = X 2 = 0.05, and 
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Table 1: Efficiency of our screening step for a range of tuning parameters. For comparison, 
CPU time for updating step (the second step in Algorithm [1]) is also presented. The fourth 
column denotes the number of groups selected by our screening step (total number of groups: 
19152), and the last column represents the number of true nonzero coefficient discarded by 
our screening step. 



Ai — A2 — A3 


Screening Time (s) 


Updating Time (s) 


# Selected Groups 


# Missing j3{ ^ 


0.001 


0.465 


13.246 


19152 





0.002 


0.482 


13.497 


19152 





0.005 


0.470 


12.515 


19151 





0.01 


0.481 


9.343 


19124 





0.02 


0.476 


6.018 


17071 





0.05 


0.246 


0.022 


116 





0.1 


0.255 


0.010 


51 





0.2 


0.249 


0.003 


16 


20 


0.5 


0.239 








52 



13 




Figure 2: (a) CPU time and (b) Fl score comparison of our proposed HiGT method, Fo- 
GLasso, SPG, and SLasso with different number of input variables under a single task re- 
gression setting. We used simulation datasets with iV = 1000, K = 1, Ai = A 2 = 0.05, and 
A 3 = 0. 

A 3 = 0. We observed that our method was much more scalable than other methods, and 
perfectly recovered true nonzero coefficients. FoGLasso achieved the same accuracy but it 
was not as fast as HiGT due to the lack of hierarchical group screening step. In the following 
comparison analysis, we included only FoGLasso and SPG which showed good performance. 

Figure [3] shows efficiency and Fl score of three methods including HiGT, FoGLasso, and 
SPG under various simulation settings. For all experiments, we set Ai = A 2 = A3 = 0.1. 
From this figure, we can observe the following: 

• For all settings with different number of groups, samples, input and output variables, 
our algorithm was much more efficient than the other methods. 

• Our HiGT algorithm and FoGLasso showed the same Fl score (close to 1) for all 
simulation settings. 

• Screening step in our algorithm never made a mistake for all experiments. 
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(a) (b) (c) (d) 




0.5 1 1 .5 2 2.5 3 1 000 2000 3000 4000 2000 4000 6000 8000 1 0000 1 00 200 300 400 

Number of groups x 1 4 Number of samples Number of input variables Number of output variables 



(e) (f) (g) (h) 

Figure 3: CPU time and Fl score comparison of our proposed HiGT method, FoGLasso, 
and SPG with different (a,e) number of groups (N = 1000), (b,f) samples (J = 500, K = 5), 
(c,g) input variables (N = 1000, K — 5), and (d,h) output variables (N = 1000, J = 150). 
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• In general, the accuracy of HiGT and FoGLasso did not decrease as the problem size 
increased. 

• For all methods, CPU time increased linearly with the number of groups but the 
slopes were significantly different. Our HiGT method has a very small slope due to 
the efficient screening step. 

4.2 Detecting eQTLs Having Interaction Effects in Yeast Genome 

We also solved problem ([2]) using our HiGT method with yeast data [1] which contains 1260 
unique SNPs and observed gene-expression levels of 5637 genes. To show the usefulness of 
our method, we briefly report the most significant eQTLs having interaction effects that we 
identified (chrl:154328-chr5:350744). According to our estimation, it turns out this pair of 
genetic variants affected 455 genes enriched with the GO category of ribosome biogenesis 
with corrected p- value < 10~ 35 . This SNP pair was very closely located on gene NUP60 and 
gene RAD51, respectively, and we found that there exists a significant genetic interaction 
between the two genes [6]. As both SNPs are closely located to NUP60 and RAD51 (within 
500bp), we can assume that the two SNPs affected the two genes (NUP60 and RAD51), 
and their genetic interaction in turn acted on a large number of genes related to ribosome 
biogenesis. It implies that this pair of SNPs can be a truly meaningful biological finding. We 
consider that our detection of this SNP pair is novel as the exact locations of the SNP pair 
were not reported in both Storey et al. [18] and a statistical test for pairwise interactions 

PI- 

5. DISCUSSIONS 

In this paper, we presented an efficient algorithm for a large-scale overlapping group lasso 
problem in highly general settings. Our method relies on a screening step which can efficiently 
discard a large number of irrelevant groups simultaneously. Our simulation confirmed that 
our model is significantly faster than other competitors while maintaining high accuracy. In 
our analysis of yeast eQTL datasets, we reported a pair of genetic variants that potentially 
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interact with each other and influence on ribosome biogenesis. 

One of promising research directions of this work would be to consider parallelization 
of our method. Note that we can naturally parallelize the screening step as it considers a 
set of groups separately. However, the second step of our algorithm needs to be performed 
sequentially after the screening step is completed. A efficiently parallelized algorithm would 
not only further speed up the algorithm but also allow us to deal with very large problems 
which cannot fit into memory. We are also interested in theoretical analysis of our screening 
step in terms of sure screening property for ultra high dimensional problems [7] or the 
properties of strong rules for discarding covariates [19] . Finally, we plan to apply our efficient 
algorithm to very large-scale eQTL mapping problems in bioinformatics for understanding 
the biological mechanisms of complex human diseases. 

REFERENCES 

[1] A. Argyriou, C.A. Micchelli, M. Pontil, L. Shen, and Y. Xu. Efficient first order methods 



for linear composite regularizers. Arxiv preprint arXiv.ll 04 .14-36, 2011 



[2] J. Bolz, I. Farmer, E. Grinspun, and P. Schrooder. Sparse matrix solvers on the gpu: 
conjugate gradients and multigrid. In ACM Transactions on Graphics (TOG), vol- 
ume 22, pages 917-924. ACM, 2003. 

[3] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein. Distributed optimization and 
statistical learning via the alternating direction method of multipliers. Foundations and 
Trends in Machine Learning, 3:1-124, 2011. 

[4] R.B. Brem and L. Kruglyak. The landscape of genetic complexity across 5,700 gene 
expression traits in yeast. PNAS, 102(5):1572, 2005. 

[5] X. Chen, Q. Lin, S. Kim, and E.P. Xing. An efficient proximal-gradient method for 
single and multi-task regression with structured sparsity. Annals of Applied Statistics, 
2010. 



17 



[6] M. Costanzo et al. The genetic landscape of a cell. Science, 327(5964):425, 2010. 

[7] J. Fan and J. Lv. Sure independence screening for ultrahigh dimensional feature space. 
Journal of the Royal Statistical Society: Series B (Statistical Methodology), 70(5):849- 
911, 2008. 

[8] L. Jacob, G. Obozinski, and J. P. Vert. Group lasso with overlap and graph lasso. In 
Proceedings of the 26th Annual International Conference on Machine Learning, pages 
433-440. ACM, 2009. 

[9] R. Jenatton, J.Y. Audibert, and F. Bach. Structured variable selection with sparsity- 
inducing norms. Journal of Machine Learning Research, 12:2777-2824, 2011. 

[10] R. Jenatton, J. Mairal, G. Obozinski, and F. Bach. Proximal methods for hierarchical 
sparse coding. Journal of Machine Learning Research, 12:2297-2334, 2011. 

[11] J. Mairal, R. Jenatton, G. Obozinski, and F. Bach. Network flow algorithms for struc- 
tured sparsity. Advances in Neural Information Processing Systems, 2010. 

[12] J. Mairal, R. Jenatton, G. Obozinski, and F. Bach. Convex and network flow opti- 
mization for structured sparsity. Journal of Machine Learning Research, 12:2681-2720, 
2011. 

[13] L. Meier, S. Van De Geer, and P. Biihlmann. The group lasso for logistic regression. 
Journal of the Royal Statistical Society: Series B (Statistical Methodology), 70(1):53-71, 
2008. 

[14] S. Mosci, S. Villa, A. Verri, and L. Rosasco. A primal-dual algorithm for group sparse 
regularization with overlapping groups. In Neural Information Processing Systems, 2010. 

[15] S.N. Negahban and M.J. Wainwright. Simultaneous support recovery in high dimen- 
sions: Benefits and perils of block ^/foo-regularization. Information Theory, IEEE 
Transactions on, 57(6):3841-3863, 2011. 

18 



[16] S. Purcell, B. Neale, K. Todd-Brown, L. Thomas, M.A.R. Ferreira, D. Bender, J. Mailer, 
P. Sklar, P.I.W. de Bakker, M.J. Daly, et al. PLINK: a tool set for whole-genome 
association and population-based linkage analyses. The American Journal of Human 
Genetics, 81(3):559-575, 2007. 

[17] Z. Qin and D. Goldfarb. Structured sparsity via alternating directions methods. ArXiv 
e-prints, 2011. 

[18] J.D. Storey, J.M. Akey, L. Kruglyak, et al. Multiple locus linkage analysis of genomewide 
expression in yeast. PLoS Biology, 3(8):1380, 2005. 

[19] R. Tibshirani, J. Bien, J. Friedman, T. Hastie, N. Simon, J. Taylor, and R.J. Tibshirani. 
Strong rules for discarding predictors in lasso-type problems. Journal of the Royal 
Statistical Society: Series B (Statistical Methodology), 2011. 

[20] L. Yuan, J. Liu, and J. Ye. Efficient methods for overlapping group lasso. Advances in 
Neural Information Processing Systems, 2011. 



19 



Algorithm 1 Hierarchical Group Thresholding (HiGT) algorithm 



Q <— groups of inputs; H <— groups of outputs 

T(Z,£) a hierarchical tree with groups of zero patterns (see Section HPT) 
{2 ( i),Z (2 ),...,Z(|jr|)}«-DFS order oiZmT(Z,£) 



(1. Screening Step) 

V ^% 
t <- 1 

while t < \Z\ do 

if Z( t ) corresponds to B§ = then 

p «- Rule in ((5J) 
else if Z( t ) corresponds to B^ m = then 

p <— Rule in Proposition Q] 
else 

t <— t + 1; continue; (Skip dummy root node) 
end if 

if p holds (condition for H(Zu)) = 0) then 

t «- DFS order of t' such that Z^ is not a descendant of Z (f ), i' > t and ^i" : t' > t" > t 

(Skip the descendants of Z^) 
else if p = Rule in Proposition Q] then 

V <— T^U groups in Z^ (Keep the groups in Z^) 

t<-t + l 
else 

t<-t + l 
end if 
end while 

(2. Updating Step) 

With the coefficients in V and their corresponding groupings in Q and W, we optimize problem @ using 
an efficient optimization technique for overlapping group lasso (We used FoGLasso [2U] for this step) . 



20 



